AMENDMENT TO THE CLAIMS 



1 . (Currently Amended) A computer readable storage media storing instructions readable by 
a computer which, when implemented, cause the computer to perform a method comprising: 
with a processor; 

segmenting a sentence of Chinese characters into constituent Chinese words having one 
or more Chinese characters by performing a Forward Maximum Matching (FMM) 
segmentation of the input sentence and a Backward Maximum Matching (BMM) 
segmentation of the input sentence;tokenizing the sentence of characters into 
known characters and at least one overlapping ambiguity string, 

wherein the overlapping ambiguity string comprises at least three Chinese characters 
having at least two possible segmentations, wherein each possible segmentation 
comprises a right portion and a left portion and wherein the left portion and the 
right portion of each of the possible segmentations remains remain in the 
tokenized corpus and the at least one overlapping ambiguity string is removed 
from the tokenized corpus; 

obtaining probability information relating to context for each possible segmentation of the 
at least three Chinese characters , wherein the probability information is based on 
at least one context feature adjacent one of the right portion or left portion of each 
of the possible segmentation, and wherein the at least one context feature 
comprises a Chinese character; and 

outputting an indication for selecting one of the at least two possible segmentations as a 
function of the obtained probability information. 

2. (Previously Presented) The computer readable storage media of claim 1, wherein 
obtaining the probability information comprises obtaining probability information from a 
language model. 



3. (Previously Presented) The computer readable storage media of claim 2 wherein the 
language model comprises a trigram model. 

4. (Previously Presented) The computer readable storage media of claim 2 wherein 
outputting an indication for selecting one of the at least two possible segmentations comprises 
classifying the probability information. 

5. (Previously Presented) The computer readable storage media of claim 4 wherein 
classifying comprises classifying using Naive Bayesian Classification. 

6. (Canceled) 

7. (Previously Presented) The computer readable storage media of claim 1 wherein 
recognizing the overlapping ambiguity string comprises recognizing a possible segmentation O f of 

the overlapping ambiguity string from the FMM segmentation and a possible segmentation O b of 
the overlapping ambiguity string from the BMM segmentation. 

8.. (Previously Presented) The computer readable storage media of claim 7, wherein 
outputting the indication comprises selecting one of the at least two possible segmentations as a 
function of a set of context features surrounding the overlapping ambiguity string. 

9. (Previously Presented) The computer readable storage media of claim 8 wherein the set of 
context features comprises words or grammatical features surrounding the overlapping ambiguity 
string. 

10. (Previously Presented) The computer readable storage media of claim 8, wherein 
outputting the indication comprises classifying the probability information of the set of context 
features and O f . 



11. (Previously Presented) The computer readable storage media of claim 8, wherein 
outputting the indication comprises classifying the probability information of the set of context 
features and O b . 

12. (Previously Presented) The computer readable storage media of claim 8, outputting the 
indication comprises determining which of O f or O b has a higher probability as a function of 
the set of context features. 

13. (Cancelled) 

14. (Currently Amended) A method of segmentation of a sentence of Chinese text, the 
sentence having an overlapping ambiguity string, the method comprising: 

with a processor; 

generating a first set of tokens utilizing a Forward Maximum Matching (FMM) 

segmentation of the sentence; 
generating a second set of tokens utilizing a Backward Maximum Matching (BMM) 

segmentation of the sentence; 
comparing the first set of tokens and the second set of tokens to determine common 

tokens and differing sets e^ having the same number of tokens where each set 

contains at least three tokens: 
recognizing the differing sets of at least three tokens as a overlapping ambiguity string; 
determining at least two different pairs of constituent lexical words in the overlapping 

ambiguity string; 

retaining the at least two different pairs of constituent lexical words in the tokenized 
sentence and removing the overlapping ambiguity string from the tokenized 
sentence; 



obtaining probability information related to context based on at least one context feature 
surrounding the at least two pairs of ov e rlapping ambiguity string and th e 
constituent lexical words, wherein the at least one context feature comprises a 
Chinese character; and 

outputting an indication for selecting one of the at least two pairs of constituent lexical 
words obtained from the FMM segmentation and the BMM segmentation as a 
function of obtained context probability information. 

15. (Previously Presented) The method of claim 14 wherein outputting includes 
selecting one of the FMM segmentation of the overlapping ambiguity string and the BMM 
segmentation of the overlapping ambiguity string based on higher probability. 

16. (Previously Presented) The method of claim 15 wherein obtaining probability 
information comprises using an N-gram model. 

17. (Previously Presented) The method of claim 16 wherein obtaining probability 
information comprises obtaining probability information about a first word of the overlapping 
ambiguity string. 

18. (Previously Presented) The method of claim 16, wherein obtaining probability 
information comprises using probability information about a last word of the overlapping 
ambiguity string. 

19. (Previously Presented) The method of claim 16, wherein obtaining probability 
information comprises using the N-gram model that includes probability information for context 
words surrounding the overlapping ambiguity string. 



20. (Previously Presented) The method of claim 16, wherein using the N-gram model 
comprises using trigram probability information about a string of words comprising a first word 
of the overlapping ambiguity string and two context words to the left of the first word. 

21. (Previously Presented) The method of claim 16, wherein using the N-gram model 
comprises using trigram probability information about a string of words comprising a last word 
of the overlapping ambiguity string and two context words to the right of the last word. 

22. (Previously Presented) The method of claim 14, wherein outputting includes using 
Naive Bayesian Classifiers. 

23. (Previously Presented) The method of claim 14, wherein obtaining probability 
information comprises obtaining trigram probability information and constructing an ensemble of 
Naive Bayesian Classifiers from the trigram probability information. 

24. (Previously Presented) The method of claim 23, wherein outputting an indication 
comprises identifying one of the FMM segmentation and the BMM segmentation based on 
probability calculated from the ensemble of Naive Bayesian Classifiers. 

25. (Currently Amended) A method of segmenting a sentence of Chinese text comprising: 
with a processor; 

segmenting an input a-sentence of Chinese characters into constituent Chinese words 
having one or more Chinese characters by performing a Forward Maximum 
Matching (FMM) segmentation of the input sentence and a Backward Maximum 
Matching (BMM) segmentation of the input sentence; 

tokenizing the sentence of characters into known characters and at least one overlapping 
ambiguity string comprising a set of at least three characters : 



determining the constituent lexical words in the overlapping ambiguity string from both 

the FMM and the BMM : 
retaining the constituent lexical words from both the FMM and the BMM in the 

tokenized sentence and removing the overlapping ambiguity string from the 

tokenized sentence; 

receiving probability information related to context from an N-gram language model 
comprising probability information for the constituent lexical words from both the 
FMM and the BMM and context features surrounding the overlapping ambiguity 
string, wherein the context features comprise at least one Chinese character; 

resolving the overlapping ambiguity string based on the received probability information. 

26. (Previously Presented) The method of claim 25, wherein receiving probability 
information comprises receiving probability information from a trigram language model. 

27. (Previously Presented) The method of claim 25, and further comprising generating 
an ensemble of classifiers with the received probability information. 

28. (Cancelled) 

29. (Previously Presented) The method of claim 25 and further comprising generating 
an ensemble of classifiers as a function of the N-gram model. 

30. (Previously Presented) The method of claim 29 wherein generating the ensemble 
of classifiers includes approximating probabilities of the FMM and BMM segmentations of the 
overlapping ambiguity string as being equal to the product of individual unigram probabilities of 
individual words in the FMM and BMM segmentations of the overlapping ambiguity string. 

31 . (Previously Presented) The method of claim 29, wherein generating the ensemble of 



classifiers includes approximating a joint probability of a set of context features conditioned on an 
existence of one of the segmentations of the overlapping ambiguity string based on a corresponding 
probability of a leftmost and a rightmost word of the corresponding overlapping ambiguity string. 



